Factorized Deep Neural Networks for Adaptive Speech Recognition
نویسندگان
چکیده
Recently, we have shown that context-dependent deep neural network hidden Markov models (CD-DNN-HMMs) can achieve very promising recognition results on large vocabulary speech recognition tasks, as evidenced by over one third fewer word errors than the discriminatively trained conventional HMM-based systems on the 300hr Switchboard benchmark task. In this paper, we propose and describe two types of factorized adaptive DNNs, improving the earlier versions of CD-DNN-HMMs. In the first model, the hidden speaker and environment factors and tied triphone states are jointly approximated; while in the second model, the factors are firstly estimated and then fed into the main DNN to predict tied triphone states. We evaluated these models on the small 30hr Switchboard task. The preliminary results indicate that more training data are needed to show the full potential of these models. However, these models provide new ways of modeling speaker and environment factors and offer insight onto how environment invariant DNN models may be constructed and subsequently trained. Index Terms — automatic speech recognition, deep neural networks, factorized DNN, CD-DNN-HMM
منابع مشابه
شبکه عصبی پیچشی با پنجرههای قابل تطبیق برای بازشناسی گفتار
Although, speech recognition systems are widely used and their accuracies are continuously increased, there is a considerable performance gap between their accuracies and human recognition ability. This is partially due to high speaker variations in speech signal. Deep neural networks are among the best tools for acoustic modeling. Recently, using hybrid deep neural network and hidden Markov mo...
متن کاملDeep Neural Factorization for Speech Recognition
Conventional speech recognition system is constructed by unfolding the spectral-temporal input matrices into one-way vectors and using these vectors to estimate the affine parameters of neural network according to the vector-based error backpropagation algorithm. System performance is constrained because the contextual correlations in frequency and time horizons are disregarded and the spectral...
متن کاملI-vector dependent feature space transformations for adaptive speech recognition
In this paper, we propose a new feature normalization approach for deep neural networks (DNNs) based adaptive speech recognition. Each speaker is represented by an i-vector, and the ivector dependent block-diagonal transformation matrix is obtained by a tensor and performed on the input features. The parameters of tensor are shared by all the frames in the input window, and factorized into thre...
متن کاملSpeech Emotion Recognition Using Scalogram Based Deep Structure
Speech Emotion Recognition (SER) is an important part of speech-based Human-Computer Interface (HCI) applications. Previous SER methods rely on the extraction of features and training an appropriate classifier. However, most of those features can be affected by emotionally irrelevant factors such as gender, speaking styles and environment. Here, an SER method has been proposed based on a concat...
متن کاملمعرفی شبکه های عصبی پیمانه ای عمیق با ساختار فضایی-زمانی دوگانه جهت بهبود بازشناسی گفتار پیوسته فارسی
In this article, growable deep modular neural networks for continuous speech recognition are introduced. These networks can be grown to implement the spatio-temporal information of the frame sequences at their input layer as well as their labels at the output layer at the same time. The trained neural network with such double spatio-temporal association structure can learn the phonetic sequence...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012